SCRAPING TENNIS WEBSITES

Data Harvesting UC3M

Irantzu Lamarca, Diego Fernandez

2025-03-27

Objectives of the project

  • Gather information from the official ATP website (rankings, player statistics…)

  • Build a scrapper to extract the data (Selenium)

  • Create a csv file and perform some analysis with it (Shiny)

Libraries used

library(scrapex)
library(rvest)
library(httr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(xml2)
library(lubridate)
library(tidyverse)
library(RSelenium)

Scraping the atp rankings

  • Atp_link <- “https://www.atptour.com/es/rankings/singles?rankRange=0-100&region=all&dateWeek=Semana%20Actual&SortField=Ranking&SortAscending=True”
  • The website does not let you scrape it unless an user-agent is set
  • 1st attempt: atp_tables[[1]] -> did not work

ATP Ranking

  • 2nd attempt
  • top 300 players
xml_find_all(atp_website,"//span[@class='lastName']")

#<!-- //span: Find all <span> elements -->
#<!-- [@class='lastName']: only with the class lastName -->

Scraping the aces of each player

Aces table:

Scraping the rest of the variables

  • Same process as with the aces table
  • After cleaning all the tables, join by ranking number
  • Additional 2 variables extracted from another website

Other scraped variables

  • Wins and looses
  • Points won with first serve
  • % of the wins of the player when he has to serve
  • % of the wins of the player when he has to receive the serve
  • % of break points saved

Age + country

  • Age and country of the player from another website -> ATP website was not letting us do it
age_country_link <- "https://tennisabstract.com/reports/atpRankings.html" |>
  read_html()

age_country <- age_country_link |> html_table()

atp_table <- age_country[[1]]
atp_table

# Let´s clean the table:

atp_table <- atp_table[-1, ]
atp_table <- atp_table[-1, ]
atp_table <- atp_table[, 1:4]

colnames(atp_table) <- c("Rank", "Player", "Country", "Birthdate")

# Keep only the first 300 rows
atp_table <- atp_table |> slice(1:300)

# Calculate age from birthdate
atp_table <- atp_table |> 
  mutate(Birthdate = as.Date(Birthdate),
         Age = floor(interval(Birthdate, today()) / years(1)))


# Keep only Rank, Last_Name, and Age
atp_table <- atp_table |>  select(Rank, Player, Age, Country)
atp_table$Rank <- as.numeric(atp_table$Rank)

Age + country table:

Final dataset

ranking name %break_points_saved points_won total_points Matches_played.x %games_won(receive) games_won_receiving total_games_receiving Matches_played.y %games_won(serve) games_won_serving total_games_serving Matches_played.x.x Aces_number Matches_played.y.y 1st_serve_win_% Points_won Total_points Matches_played Index Titles Wins Losses
1 Sinner 73.67% 235 319 76 28.30% 270 954 76 91.40% 893 977 76 563 76 563 76 NA NA NA NA NA NA
2 Zverev 69.73% 258 370 90 22.26% 254 1141 90 90.18% 1029 1141 90 797 90 797 90 NA NA 0.706 0 12 5
3 Alcaraz 61.85% 167 270 64 30.79% 218 708 64 85.47% 606 709 64 265 64 265 64 NA NA 0.833 1 15 3
4 Fritz 68.97% 240 348 74 20.00% 180 900 74 88.32% 817 925 74 725 74 725 74 NA NA 0.667 0 10 5
5 Djokovic 64.84% 142 219 45 28.95% 152 525 45 85.71% 462 539 45 275 45 275 45 NA NA 0.636 0 7 4
6 Ruud 62.97% 233 370 73 23.79% 201 845 73 83.90% 714 851 73 400 73 400 73 NA NA 0.769 0 10 3
7 Draper 60.68% 179 295 59 24.83% 183 737 59 84.15% 616 732 59 486 59 486 59 NA NA 0.867 1 13 2
8 Medvedev 61.03% 260 426 67 27.49% 229 833 67 79.78% 655 821 67 426 67 426 67 NA NA 0.667 0 12 6
9 Rublev 68.83% 265 385 69 20.42% 184 901 69 86.73% 784 904 69 578 69 578 69 NA NA 0.600 1 9 6
10 Tsitsipas 68.37% 227 332 66 21.49% 162 754 66 86.31% 662 767 66 421 66 421 66 NA NA 0.667 1 10 5

Usage of the data

  • Remove duplicate or irrelevant columns
  • Convert percentages to numeric values
  • Handle missing values
  • Convert data into correct format
# Delete “Matches_played” and “Total_points” columns because they have 100% of NA
df <- df %>% select(-Matches_played, -Total_points)

# Merge column data into one column, taking non-NA values
df$matches_played <- coalesce(df$Matches_played.x, df$Matches_played.y, df$Matches_played.x.x, df$Matches_played.y.y)

# Delete original columns
df <- df %>% select(-Matches_played.x, -Matches_played.y, -Matches_played.x.x, -Matches_played.y.y)

# Delete unwanted columns
df <- df %>% select(-Points_won, -total_points, -points_won)

Shiny app 1

  • ATP Player Comparator
  • Select a professional tennis player and compare their performance metrics with custom input values
  • Purpose: provide players, coaches, and analysts with a way to compare their performance to elite players, helping to identify strengths and areas for improvement

Shiny app 2

  • Tennis match simulator
  • Two simulation modes:
    • Monte Carlo simulation (predicts match outcomes over 1,000 iterations)
    • Set-based simulation that determines the winner set by set
  • Valuable tool for fans, analysts, and players who want to assess potential matchups and performance factors